API: Reformat output of groupby.describe (#4792) #15260

mroeschke · 2017-01-30T07:12:15Z

closes describe on a groupby #4792
tests added / passed
passes git diff upstream/master | flake8 --diff

Doesn't look like this was address in a PR in 0.20, but the original issue works on master.

jreback · 2017-01-30T11:33:22Z

pandas/tests/groupby/test_groupby.py

+                           'VOLUME': volumes})
+        result = df.groupby('PRICE').describe()
+        expected_index = pd.MultiIndex(levels=[[24990, 25499],
+                                               ['count', 'mean', 'std',


instead of constructing this way

use concat with keys on the subframes

jreback · 2017-01-30T14:20:29Z

Actually I think the bug is still present. The issue that you want this

In [10]: df.groupby('PRICE').VOLUME.describe().unstack(1)
Out[10]: 
       count          mean           std           min           25%           50%           75%           max
PRICE                                                                                                         
24990    1.0  1.500000e+09           NaN  1.500000e+09  1.500000e+09  1.500000e+09  1.500000e+09  1.500000e+09
25499    2.0  2.550000e+09  3.464823e+09  1.000000e+08  1.325000e+09  2.550000e+09  3.775000e+09  5.000000e+09

rather than this

In [9]: df.groupby('PRICE').VOLUME.describe()
Out[9]: 
PRICE       
24990  count    1.000000e+00
       mean     1.500000e+09
       std               NaN
       min      1.500000e+09
       25%      1.500000e+09
       50%      1.500000e+09
       75%      1.500000e+09
       max      1.500000e+09
25499  count    2.000000e+00
       mean     2.550000e+09
       std      3.464823e+09
       min      1.000000e+08
       25%      1.325000e+09
       50%      2.550000e+09
       75%      3.775000e+09
       max      5.000000e+09
Name: VOLUME, dtype: float64

we do a similar unstack already with .ohlc.

In [8]: df.groupby('PRICE').VOLUME.ohlc()
Out[8]: 
             open        high         low       close
PRICE                                                
24990  1500000000  1500000000  1500000000  1500000000
25499  5000000000  5000000000   100000000   100000000

So each group gets a single row, while multi-columns are present for multiple aggregations.
multi-index are present for multiple groupers.

mroeschke · 2017-01-30T21:36:38Z

Ah that makes sense. Thanks for clarifying the expected output!

I've been poking into the code, and since each group goes through apply() and describe() returns the metrics labeled on the index, it tries to vertically concat the describe() results for each group:

(Pdb) values
[             VOLUME
count  1.000000e+00
mean   1.500000e+09
std             NaN
min    1.500000e+09
25%    1.500000e+09
50%    1.500000e+09
75%    1.500000e+09
max    1.500000e+09,              
              VOLUME
count  2.000000e+00
mean   2.550000e+09
std    3.464823e+09
min    1.000000e+08
25%    1.325000e+09
50%    2.550000e+09
75%    3.775000e+09
max    5.000000e+09]

I could add some logic saying if the indexes for each group are the same, concat on the index and transpose to the columns, but I think that'd be a pretty big change since it will probably affect all groupby.apply() functions. Or should we make a special case for describe()? Thoughts @jreback?

jreback · 2017-01-30T21:48:53Z

so you do just need to define .describe as a method on DataFrameGroupby and SeriesGroupby, then you can do the .apply (followed by the unstack).

Right now we automagically just do any whitelisted function (including .describe) and they go thru the standard reshaping things. If you can figure out a generic way to do this great, but otherwise defining a method is fine.

mroeschke · 2017-02-01T07:39:34Z

Cool, thanks for the insight.

I defined a new method for groupby.describe() and noticed in the process if groupby(...,axis=1).describe() is called, the transposing returns the desired results instead of unstacking. I also had to subsequently change a lot of the existing tests since the output has changed. Most notably the test for #14848 changed a lot.

jreback · 2017-02-01T14:10:06Z

doc/source/whatsnew/v0.20.0.txt

@@ -366,6 +366,7 @@ Other API Changes
 - ``inplace`` arguments now require a boolean value, else a ``ValueError`` is thrown (:issue:`14189`)
 - ``pandas.api.types.is_datetime64_ns_dtype`` will now report ``True`` on a tz-aware dtype, similar to ``pandas.api.types.is_datetime64_any_dtype``
 - ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
+ - ``groupby.describe()`` now labels the `describe()` metrics in the column instead of the index (:issue:`4792`)


I think move this to a sub-section and show previous and current behavior.

jreback · 2017-02-01T14:11:50Z

pandas/core/groupby.py

@@ -1140,6 +1139,17 @@ def ohlc(self):

    @Substitution(name='groupby')
    @Appender(_doc_template)
+    def describe(self, **kwargs):
+        """
+        Provide summary statistics for each group, excluding NaN values


pls add a full doc-string, with named arguments. You might be able to simply add Series.describe and DataFrame.describe in the Notes section.

or as you do below just use the DataFrame doc-string

mroeschke · 2017-02-03T07:17:50Z

Added previous/current behavior in whatsnew, documentation to describe with DataFrame.describe.__doc__, and fixed other failing tests.

jreback · 2017-02-03T13:53:55Z

doc/source/whatsnew/v0.20.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The output formatting of ``groupby.describe()`` now labels the ``describe()`` metrics in the columns instead of the index.
+This format is consistent with ``groupby.ohlc()`` (:issue:`4792`)


more to the point its consistent with how .agg() works.

most people prob don't know about .ohlc() :>

jreback · 2017-02-03T13:54:43Z

doc/source/whatsnew/v0.20.0.txt

+
+New Behavior:
+
+.. code-block:: ipython


use an ipython-block here (so the code executes)

jreback · 2017-02-03T13:59:17Z

pandas/tests/groupby/test_groupby.py

-        expected.index.names = ['A', None]
+        expected = pd.concat([(df[df.A == 1].B
+                                            .describe()
+                                            .to_frame()


This seems like lots of reshapings (this is current master)

In [6]: df Out[6]: A B C 0 1 2.0 foo 1 1 NaN bar 2 3 NaN baz In [7]: df.groupby('A').describe() Out[7]: B A 1 count 1.0 mean 2.0 std NaN min 2.0 25% 2.0 50% 2.0 75% 2.0 max 2.0 3 count 0.0 mean NaN std NaN min NaN 25% NaN 50% NaN 75% NaN max NaN In [8]: df.groupby('A').describe().unstack() Out[8]: B count mean std min 25% 50% 75% max A 1 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 3 0.0 NaN NaN NaN NaN NaN NaN NaN

In this test, the result = df.groupby('A').describe().unstack() after unstack() was added to groupby.describe(). Shouldn't expected follow an independent path to the result?

yes, but you can simply directly construct this result (as its 'simple' enough), just pd.DataFrame(.....)

Ah I see thanks! Agreed that my it my first edit probably included too much reshaping.

mroeschke · 2017-02-07T06:23:39Z

Replaced API example using groupby.agg() instead of groupby.ohlc(), fixed the code block, and simplified expected in test_non_cython_api

jreback · 2017-02-07T15:34:25Z

hmm, something is wrong here. Its including the grouper.

In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [2]: df.groupby('A').describe()
Out[2]: 
      A                                        B                                          
  count mean  std  min  25%  50%  75%  max count mean       std  min   25%  50%   75%  max
A                                                                                         
1   2.0  1.0  0.0  1.0  1.0  1.0  1.0  1.0   2.0  1.5  0.707107  1.0  1.25  1.5  1.75  2.0
2   2.0  2.0  0.0  2.0  2.0  2.0  2.0  2.0   2.0  3.5  0.707107  3.0  3.25  3.5  3.75  4.0

In [3]: df.groupby('A').agg(['mean', 'std'])
Out[3]: 
     B          
  mean       std
A               
1  1.5  0.707107
2  3.5  0.707107

mroeschke · 2017-02-07T19:39:16Z

It looks like the grouper is included when using groupby.apply():

In [4]: pd.__version__
Out[4]: u'0.19.2' #not current master

In [5]: df.groupby('A').apply(lambda x: x.describe()) 
Out[5]:
           A         B
A
1 count  2.0  2.000000
  mean   1.0  1.500000
  std    0.0  0.707107
  min    1.0  1.000000
  25%    1.0  1.250000
  50%    1.0  1.500000
  75%    1.0  1.750000
  max    1.0  2.000000
2 count  2.0  2.000000
  mean   2.0  3.500000
  std    0.0  0.707107
  min    2.0  3.000000
  25%    2.0  3.250000
  50%    2.0  3.500000
  75%    2.0  3.750000
  max    2.0  4.000000

Can be seen with other functions as well:

In [11]: df.groupby('A').apply(np.mean) #not idiomatic but should be similar to  df.groupby('A').mean()
Out[11]:
     A    B
A
1  1.0  1.5
2  2.0  3.5

Is this a known issue?

jreback · 2017-02-07T19:48:43Z

@mroeschke so by-definition this is what apply does.

you can use ._set_group_selection() to avoid this problem.

mroeschke · 2017-02-07T19:52:33Z

Ah thanks for clarifying that @jreback. Will edit to use ._set_group_selection() tonight.

Restructure describe def Fix another test Refactoring tests linting & patch groupby tests add whatsnew fix docstring fix more tests Added api example and documentation to describe fix potential pep8 complaint adjust doc description renamed original test and add agg example in doc simplify example Eliminate grouper from result simplify example in the whatsnew

codecov-io · 2017-02-09T07:11:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@c23b1a4). Click here to learn what that means.

@@            Coverage Diff            @@
##             master   #15260   +/-   ##
=========================================
  Coverage          ?   86.32%           
=========================================
  Files             ?      141           
  Lines             ?    51177           
  Branches          ?        0           
=========================================
  Hits              ?    44180           
  Misses            ?     6997           
  Partials          ?        0

Impacted Files	Coverage Δ
pandas/core/groupby.py	`95.13% <92.85%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c23b1a4...618bc46. Read the comment docs.

mroeschke · 2017-02-10T00:06:23Z

Added _set_group_selection() to prevent the grouper from being included and resolved conflict with whatsnew

jreback · 2017-02-10T00:19:32Z

thanks!

keep em coming!

closes pandas-dev#4792 Author: Matt Roeschke <[email protected]> Author: Matthew Roeschke <[email protected]> Closes pandas-dev#15260 from mroeschke/fix_4792 and squashes the following commits: 618bc46 [Matthew Roeschke] Merge branch 'master' into fix_4792 184378d [Matt Roeschke] TST: groupby.describe levels don't appear as column (pandas-dev#4792)

jreback requested changes Jan 30, 2017

View reviewed changes

jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite labels Jan 30, 2017

jreback added this to the 0.20.0 milestone Jan 30, 2017

jreback removed the Testing pandas testing functions or related to the test suite label Jan 30, 2017

mroeschke changed the title ~~TST: groupby.describe levels don't appear as column (#4792)~~ [WIP] : Reformat output of groupby.describe (#4792) Jan 31, 2017

jreback mentioned this pull request Jan 31, 2017

Inconsistent handling of index after groupby operation #15272

Closed

mroeschke force-pushed the fix_4792 branch from 74bfd2f to dfae3f2 Compare February 1, 2017 07:32

mroeschke force-pushed the fix_4792 branch from dfae3f2 to 231d441 Compare February 1, 2017 07:43

jreback mentioned this pull request Feb 1, 2017

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

Closed

jreback requested changes Feb 1, 2017

View reviewed changes

mroeschke force-pushed the fix_4792 branch 2 times, most recently from 4b5d367 to 4375710 Compare February 3, 2017 07:17

jreback requested changes Feb 3, 2017

View reviewed changes

jreback mentioned this pull request Feb 4, 2017

Groupby transform idxmax return floats #15306

Closed

mroeschke force-pushed the fix_4792 branch from 4375710 to 7bf7771 Compare February 6, 2017 20:24

mroeschke changed the title ~~[WIP] : Reformat output of groupby.describe (#4792)~~ API: Reformat output of groupby.describe (#4792) Feb 6, 2017

mroeschke force-pushed the fix_4792 branch from 7bf7771 to 9643f01 Compare February 6, 2017 22:35

jreback approved these changes Feb 7, 2017

View reviewed changes

mroeschke force-pushed the fix_4792 branch from 9643f01 to 184378d Compare February 9, 2017 07:11

Merge branch 'master' into fix_4792

618bc46

jreback closed this in 3d6fcdc Feb 10, 2017

mroeschke deleted the fix_4792 branch December 20, 2017 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Reformat output of groupby.describe (#4792) #15260

API: Reformat output of groupby.describe (#4792) #15260

mroeschke commented Jan 30, 2017

jreback Jan 30, 2017

jreback commented Jan 30, 2017

mroeschke commented Jan 30, 2017

jreback commented Jan 30, 2017

mroeschke commented Feb 1, 2017

jreback Feb 1, 2017

jreback Feb 1, 2017

jreback Feb 1, 2017

mroeschke commented Feb 3, 2017

jreback Feb 3, 2017

jreback Feb 3, 2017

jreback Feb 3, 2017

jreback Feb 3, 2017

mroeschke Feb 6, 2017

jreback Feb 6, 2017

mroeschke Feb 6, 2017

mroeschke commented Feb 7, 2017

jreback commented Feb 7, 2017

mroeschke commented Feb 7, 2017

jreback commented Feb 7, 2017

mroeschke commented Feb 7, 2017

codecov-io commented Feb 9, 2017 •

edited

Loading

mroeschke commented Feb 10, 2017

jreback commented Feb 10, 2017

API: Reformat output of groupby.describe (#4792) #15260

API: Reformat output of groupby.describe (#4792) #15260

Conversation

mroeschke commented Jan 30, 2017

Choose a reason for hiding this comment

jreback commented Jan 30, 2017

mroeschke commented Jan 30, 2017

jreback commented Jan 30, 2017

mroeschke commented Feb 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Feb 7, 2017

jreback commented Feb 7, 2017

mroeschke commented Feb 7, 2017

jreback commented Feb 7, 2017

mroeschke commented Feb 7, 2017

codecov-io commented Feb 9, 2017 • edited Loading

Codecov Report

mroeschke commented Feb 10, 2017

jreback commented Feb 10, 2017

codecov-io commented Feb 9, 2017 •

edited

Loading